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Abstract 

Today’s monolithic kernels often implement a small, fixed 
set of policies such as disk I/O scheduling policies, while 
exposing many parameters to let users select a policy or 
adjust the specihc setting of the policy. Ideally, the param¬ 
eters exposed should be flexible enough for users to tune 
for good performance, but in practice, users lack domain 
knowledge of the parameters and are often stuck with bad, 
default parameter settings. 

We present EOS, a system that bridges the knowledge 
gap between kernel developers and users by automatically 
evolving the policies and parameters in vivo on users’ real, 
production workloads. It provides a simple policy specifi¬ 
cation API for kernel developers to programmatically de¬ 
scribe how the policies and parameters should be tuned, 
a policy cache to make in-vivo tuning easy and fast by 
memozing good parameter settings for past workloads, 
and a hierarchical search engine to effectively search the 
parameter space. Evaluation of EOS on four main Linux 
subsystems shows that it is easy to use and effectively im¬ 
proves each subsystem’s performance. 

1 Introduction 

A classic principle in OS kernel design is to separate 
mechanisms and policies im. Specihc ally, kernel devel¬ 
opers build a small yet expressive set of mechanisms, on 
top of which users can implement hexible policies opti¬ 
mal for their workloads without the need to re-implement 
the mechanisms. In today’s monolithic kernels such as 
Linux, this principle manifests in a different form. Specif¬ 
ically, these kernels often implement a small, hxed set 
of policies, while exposing many parameters for users 
to a policy or adjust the specihc settings of a policy. 
Lor example, Linux provides three I/O scheduling poli¬ 
cies named deadline, cfq, and noop (see il7.1l for details), 
and users can write a policy’s name to /sys/block/jdisk 
name^/queue/scheduler to select the policy. These poli¬ 
cies each have between 0 to 12 parameters and users can 
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Ligure 1: EOS’s architecture. 


write to /sys/block/jdiskname^/queue/iosched/jparameter 
name^ to change the values of the parameters. 

Ideally, the parameters exposed by the kernels should 
be hexible enough so that users can tune the parameters 
to get good performance. Unfortunately, this ideal breaks 
down in practice because of a knowledge gap between 
kernel developers and users. The developers implement 
the policies so they know the effects of the parameters 
well, but they do not know what workloads users will 
run. They typically use workloads that matter to them¬ 
selves (e.g., rifiake -j of the Linux kernel |0) to set the 
default policies and parameters. Users know their work¬ 
loads well but often lack deep understanding of the kernel 
internals. Lor example, the Linux completely fair sched¬ 
uler (CLS) has over 10 parameters with obscure names 
tightly tied to the CLS algorithm and implementation. A 
“brief” Linux performance tuning guide alone has several 
hundred pages ||25]| , most of which are just rules of thumb. 
Even performance-tuning experts consider the tuning of 
OS performance as a black art Eiiiasa. Users thus 
rarely tune the parameters or tune them correctly, and get 
stuck with the bad, default policies and parameters set by 
kernel developers. Our experiments show that these de¬ 
fault settings sometimes degrade performance by over 10 
times (EB. 

This paper presents EOS, a system that bridges the 
knowledge gap between kernel developers and users by 
automatically evolving the policies and parameters in vivo 



























on users’ real, production workloads. Key in EOS are three 
new ideas, illustrated in Figure [T] and described below. 

First, EOS provides a simple policy specification API 
for kernel developers to programmatically describe how 
the policies and parameters should be tuned while imple¬ 
menting the policies. Specihcally, they can use the API to 
describe: (1) metadata of the parameters that annotate the 
parameters with additional information, such as where the 
parameters are stored in memory (so that EOS can modify 
them) and the value ranges of the parameters; (2) sensors 
that capture the characteristics of the workloads, such as 
the average I/O request size, which EOS uses to identify 
workloads (see next paragraph); and (3) optimization tar¬ 
gets that developers intent to measure the system’s per¬ 
formance, such as the number of I/O requests per unit of 
time, so EOS knows what to optimize for. This API helps 
developers pass their domain knowledge to EOS so that 
it can automatically tune the policies and parameters in 
vivo. 

Second, EOS provides a policy cache to make in-vivo 
tuning easy and fast. It is often very time-consuming to 
search a large policy and parameter space to hnd a good 
setting. Fortunately, workloads are often stable or repeti¬ 
tive over time. For instance, Wikipedia HTTP traces show 
highly repetitive patterns everyday uni; a web proxy 
I/O trace at MSR Cambridge shows highly stable behav¬ 
iors HI. Thus, once EOS hnds a good policy and parame¬ 
ter setting for a workload, it stores this setting into the pol¬ 
icy cache. Next time a similar workload comes, EOS sim¬ 
ply reuses the cached setting. A second use of the policy 
cache is to store the intermediate result before the search 
of a good setting for a workload is done. Since there are 
many settings to search, EOS may hnd a good setting only 
after the workload has repeated many times. EOS stores 
the intermediate result into the policy cache so that next 
time it does not have to restart the search from the very 
beginning. 

Third, EOS provides a hierarchical search engine to 
effectively search the policy and parameter space. This 
framework works as follows. At the top level, it uses a 
simple, threshold-based algorithm to detect which kernel 
component (e.g., I/O scheduling or page replacement) is 
the bottleneck. Then, at the component level, it enumer¬ 
ates through the policies and, for each policy, it searches 
through different values of the parameters. For each pol¬ 
icy and parameter setting, it measures the system’s per¬ 
formance, and picks the best performing setting. EOS cur¬ 
rently uses a greedy descendant algorithm m by default 
at this level, and allows developers to plug in their favorite 
search algorithms. 

We explicitly designed our EOS system to bridge the 


knowledge gap between kernel developers and users; it 
is orthogonal to the massive bodies of work on creating 
search algorithms that find good or optimal settings out 
of a huge parameter search space. EOS aims to find a 
good parameter setting that significantly improves per¬ 
formance. The setting does not have to be the optimal 
because hnding an optimal setting often requires sophisti¬ 
cated tuning algorithms and is extremely time-consuming. 
We explicitly designed EOS to handle stable or repeatable 
workloads, not flash workloads because flash workloads 
occur rarely. 

We implemented EOS in Linux. The policy specifica¬ 
tion API consists a set of C macros and functions. To en¬ 
able this API, developers simply include a header hie. The 
policy cache and the hierarchical search framework are 
implemented as a Linux kernel module, dynamically load¬ 
able for Hexibility of use. Its current search algorithms are 
tailored for local policies that do not tightly depend on ex¬ 
ternal environments such as networks because of the un¬ 
predictability of these environments. For instance policies 
in the network stack are out of the scope of EOS. 

We evaluated EOS on the policies in all four main non¬ 
networking subsystems in Linux 3.8.8: disk I/O schedul¬ 
ing, CPU scheduling, synchronization, and page replace¬ 
ment. We augmented synchronization and page replace¬ 
ment with state-of-the-art policies to provide a more thor¬ 
ough evaluation of EOS. Results show that (1) EOS is easy 
to use, requiring a couple of hours and 16 to 50 lines of 
code to specify the policies in each subsystem and (2) 
it effectively improves each subsystem’s performance by 
an average of 1.24 to 2.58 times and sometimes over 13 
times. 

The rest of this paper is organized as follows. @ 
presents EOS’s design goals. 0 describes EOS’s policy 
specihcation API, 0 EOS’s policy cache, and iQ EOS’s 
hierarchical search engine. discusses implementation 
issues. ij2]evaluates EOS’s performance, ^discusses re¬ 
lated work, and ^concludes. 

2 Design Goals 

A key design goal in EOS is to hnd a parameter settings 
with good performance; it does not aim to hnd the abso¬ 
lute optimal setting. Massive bodies of work have been 
devoted to hnd optimal settings to maximize system per¬ 
formance. However, current methods for hnding optimal 
settings are still too complex and too slow. We believe 
this design goal makes EOS ready for practical use by ker¬ 
nel developers and users. In addition, once advances are 
made to the algorithms for hnding optimal settings, EOS 
can simply adopt them. 
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(a) One-week Wikipedia HTTP trace 


(b) One-week MSR Web proxy I/O trace 


Figure 2: Traces of two example real-world workloads. The Wikipedia trace is periodical, and the MSR trace is stable. 


A second key design goal in EOS is to focus on rela¬ 
tively steady, repeatable workloads, not flash workloads. 
This design goal benefits EOS in two ways. First, steady, 
repeatable workloads give EOS enough time to search 
through many policy and parameter settings to And a good 
setting. This search process may take some time, making 
it difficult to catch up flash workloads that last for very 
short periods of time. Second, steady, repeatable work¬ 
loads increase the hit ratio of EOS’s policy cache, improv¬ 
ing its speed. Recall that the policy cache maps workloads 
to settings that yield good performance on the workloads. 
The more steady or repeatable a workload is, the more 
likely EOS finds a setting in this cache for the workload. 

In practice, many workloads match this design goal. 
Figure |2] shows traces of two such workloads, both col¬ 
lected from real-world server environments. The first is 
a HTTP trace to Wikipedia llT4l . one of the world’s most 
popular websites. It covers a one-week period, randomly 
picked from all traces in WikiBench ifTsll , a realistic web 
hosting benchmark. The trace is very periodical, repeat¬ 
ing a similar spiky pattern every 24 hours. The second 
trace is an I/O throughput (measured in GB/hour) trace 
to a Web proxy server at MSR Cambridge IH, obtained 
from the widely used lOTTA trace repository iflSll . The 
trace is rather steady with few spikes. It is unsurprising 
that many workloads, especially those in server environ¬ 
ments, are steady or repeatable because events that cause 
flash workloads are rare. 

3 Policy Specification API 

Figure[3shows EOS’s policy specification API. A key data 
Structure in the API is struct eos.param which describes 
the metadata of a parameter. It includes the unique name 


struct eos.pai'am { 

const char *nanie; // unique name of the parameter 

const char *subsys; // the subsystem this parameter belongs 

unsigned long min_value, max.value; // value range 

int linear.or.exponentiaLsearch; 

unsigned long (*getter){void *param); // optional 

void (*setter)(void *param, unsigned long value); // optional 

void *param; // where the parameter is stored in memory 

}; 

struct eos.guard { 

struct eos.param *param; 
unsigned long value; 

}; 

void eos.register.param(struct eos.param *param, 
struct eos.guard *guard) 

struct eos.subsys { 

const char *name; // unique name of the subsystem 
void (*get.sensors)(int size, unsigned long *sensors, 
int *similarity.threshold); 
unsigned long (*get.optimization.target)(); 
int (*is.bottlenecked)(); // returns zero unless bottlenecked 

void eos_register_subsys{stnict eos_subsys *subsys); 

Figure 3: EOS’s policy specification API. 

of the parameter, the kernel subsystem this parameter be¬ 
longs to (for hierarchical search), the value range of the 
parameter (for simplicity, EOS requires that parameters 
have only unsigned long values), how to adjust the value 
of the parameter in search (linearly by adding or subtract¬ 
ing 1 or exponentially by doubling or halving). In ad¬ 
dition, this struct provides a setter and a getter method 
for accessing the value of the parameter, and held param 
points to where the parameter is in memory. As a shortcut, 
developers can leave either or both of the methods NULL, 
and EOS simply accesses the param as a pointer to an un- 
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// called to detect and register a block device 
int blk_probe(void) { 

// register disk into system 
add_disk(disk); 

// [EOS] I/O scheduling policy for this disk 
struct eos.pai'am *param_sched; 

param.sched = kmalloc{sizeof(struct eos.param), GFP.KENREL); 
param.sched—>name = ... // name based on disk—>disk_name 
param.sched—>subsys = "disk"; 

param.sched—>min_value = 0; // three I/O scheduling policies: 
param_sched—>max_value = 2; // 0: deadline; 1: cfq; 2: noop 

eos_register_param{param_sched, NULL); 

} 

Figure 4; Annotating a disk parameter in EOS. 

signed long. Once a kernel developer fill this struct for a 
parameter, she can register this parameter with EOS with 
method eos_register_param. 

Figure |4] shows an example annotating a parameter in 
the disk I/O scheduling subsystem. blk_probe is a Linux 
function for detecting and registering disks. After a disk 
is registered via add_disk(disk), developers use EOS’s API 
to describe the parameters that control the I/O scheduling 
policy for this disk. Since there are three disk I/O schedul¬ 
ing policies, the minimum value and the maximum value 
of this parameter are 0 and 2 respectively. 

EOS provides a second struct eos_guard to capture the 
dependencies between parameters. A parameter may be 
active only when another parameter has a certain value. 
For instance, the quantum parameter in disk I/O subsys¬ 
tem is only active when the scheduler parameter is set to 
CFQ. Thus, it is useless to tune the quantum parameter un¬ 
less scheduler is set to CFQ. To express parameter depen¬ 
dency, a kernel developer passes an eos_guard when reg¬ 
istering a parameter. While in theory a parameter may de¬ 
pend on multiple other parameters with constraints other 
than equalities, in practice we never observed such cases 
for all parameters in all four Linux kernel subsystem we 
evaluated. Thus, EOS’s API supports specifying depen¬ 
dency only on one parameter with an equality constraint, 
though augmenting it is easy. 

EOS requires a third struct eos_subsys for supporting 
a kernel subsystem. This struct includes three important 
methods. First, method get_sensors collects the charac¬ 
teristics of a workload, which the policy cache uses to 
distinguish different workloads. The argument size spec¬ 
ifies the size of sensors and similarity .threshold, two ar¬ 
rays to hold outputs from this method, sensors holds the 


concrete values of the workload characteristics, such as 
the average size of all I/O requests, similarity .threshold 
specifies a percentage variance such that if two sen¬ 
sor values are within the similarity threshold, they are 
considered the same. (See lj4] for more details.) Sec¬ 
ond, method get.optimizationJarget returns an unsigned 
long value representing the subsystem performance on the 
workload, which EOS seeks to maximize when it searches 
through different parameter settings. Kernel developers 
can choose to let users provide workload-specific opti¬ 
mization targets by providing a system call for setting the 
optimization target. Lastly, method is.bottleneck checks 
whether the subsystem is the bottleneck right now. A typ¬ 
ical implementation of this method is to compare the per¬ 
centage of time the subsystem is busy with a threshold 
(e.g., 80%). 

Besides the data structures and methods described so 
far, EOS’s policy specification API also provides syntactic 
sugar to further ease the use of the API. The most useful 
syntactic sugar is the macro eosj'egister.param.static; it 
lets kernel developers register a eos.param struct for a file- 
or global-scope parameter at compile time, as opposed to 
calling eosjegister.param at runtime, so that developers 
can put this macro right next to the declaration of a pa¬ 
rameter. To implement this macro, EOS puts compile-time 
registered eos.param structs in a special ELF section in 
the Linux kernel or in a kernel module using linker script 
tricks. 

4 Policy Cache 

The policy cache brings two benefits. First, it memo- 
izes good parameter settings and reuses them on similar 
workloads, greatly reducing the time spent in searching 
for good parameter settings. Second, it helps make the 
search incremental. Specifically, when a workload runs 
for a time shorter than what it takes for EOS to find a good 
parameter setting, EOS stores the intermediate search re¬ 
sult in the policy cache, so that the next time a similar 
workload runs, EOS can resume the search. 

EOS maintains a sub-cache for each kernel subsystem 
because a subsystem’s parameters are typically indepen¬ 
dent of another subsystem’s. A sub-cache is organized 
as a list of <workload signature, parameter 
setting> pairs, where the workload signature captures 
the characteristics of a workload and the parameter setting 
makes the subsystem perform well on the workload. EOS 
computes the workload signature by invoking the subsys¬ 
tem’s get.sensors method. 

To search a sub-cache to see if a workload exists, EOS 
invokes get.sensors to compute the signature of the work- 


4 


load, denoted si. It then scans its list and compares the 
signature with each signature S 2 on the list to see if si 
is within the similarity threshold of S 2 - For instance, if 
Si has a held with value 8, S 2 ’s corresponding held has 
a value 10 and similarity threshold 20, then EOS consid¬ 
ers that these two helds have similar values because 8 is 
within 20% of 10. If EOS determines that all helds of the 
two signatures are similar, it considers the two workloads 
similar and reuses the setting associated with workload S 2 
for workload si. (There may be multiple entries matching 
a given signature; EOS always returns the hrst match.) 

To store intermediate search results, EOS stores addi¬ 
tional information used by the search engine (iQ in ad¬ 
dition to the current parameter setting. When a similar 
workload runs, EOS uses the additional information to re¬ 
sume the search. 

Implementation-wise, EOS limits the cache size to be 
1000 and replaces entries that are least recently used. 
In our evaluation, cache replacement never occurred for 
any of the workloads. A 12-hour Wikipedia trace used 
only 130 entries, the maximum of all evaluated work¬ 
loads. EOS persists the policy cache across reboots to 
/var/eos/cache to save warm-up time. 

5 Hierarchical Search Engine 

EOS’s hierarchical search engine operates at two levels: 
it first detects which subsystem is bottlenecked and then 
searches for a good parameter setting within the subsys¬ 
tem. Once the subsystem is no longer a bottleneck, a sec¬ 
ond subsystem may become bottlenecked and EOS moves 
on to tune the second subsystem. The rational to focus on 
one subsystem at a time matches the general performance 
tuning experience: performance problems typically oc¬ 
cur when only one subsystem is bottlenecked. To detect 
which subsystem is bottlenecked, EOS invokes the subsys¬ 
tem’s is_bottleneck method. 

To search for a good parameter setting, EOS can in 
principle leverage the algorithms proposed by the mas¬ 
sive bodies of prior work on performance tuning ll^ . We 
opted for an algorithm called orthogonal search HOll be¬ 
cause this algorithm is simple, hnishes quickly, and hnds 
parameter settings with good performance. Operationally, 
this algorithm iterates through the list of parameters and 
hnds the best value for each parameter. It then combines 
these best values into the resultant parameter setting.xs 
The intuitions are that (1) parameters are largely indepen¬ 
dent so they can be searched separately and (2) the effects 
of the parameters on performance are largely monotonic, 
e.g., if increasing the value of a parameter improves per¬ 
formance, then we should keep increasing the value. 


We made two modihcations to this algorithm to make it 
more efficient within the context of EOS. First, our modi- 
hed algorithm prioritizes toward the more limited param¬ 
eters - those with smaller number of possible values. The 
intuition is that, since the number of possible values is 
small, the difference in the values often has a large impact 
on performance. Thus, once EOS hnds a good value for a 
limited parameter at the beginning of the search process, it 
enjoys a good application performance for the rest of the 
search, improving the average application performance. 
Second, our modihed algorithm respects the dependen¬ 
cies between parameters. Recall that kernel developers 
can express when a parameter is active depending on an¬ 
other parameter using EOS’s API (0. Our algorithm does 
not search a parameter if it is not active. 

EOS periodically checks whether it should initiate a new 
search process. In our current implementation, this period 
is every 15 minutes. When searching within a subsys¬ 
tem, EOS checks the subsystem’s performance by calling 
get_optimization_target 5 seconds after it activates a set¬ 
ting to allow the setting to stabilize. 

6 Implementation 

We implemented EOS in Linux 3.8.8. The policy speci- 
hcation API consists of a set of C macros and functions. 
To enable this API, kernel developers simply include a 
header file. The policy cache and the hierarchical search 
engine are implemented as a dynamically loadable kernel 
module, making it flexible for users. 

To provide a more thorough evaluation of EOS, we fur¬ 
ther modihed Linux to add several additional policies to 
two subsystems. We describe these modihcations in the 
next two subsections. 

6.1 Synchronization Subsystem 

Linux uses ticket spin lock as the basic low-level syn¬ 
chronization policy. However, its performance is not 
good when a lock is seriously contended or seldom con¬ 
tended II 22 I . To solve this problem, we made two modih¬ 
cations to the spin lock implementation, described below. 

The hrst modihcation adds a back-off to the ticket 
spin lock. The pseudo code is shown in Figure |5] In¬ 
stead polling the current variable in the lock constantly 
which causes cache line bounces in high contention, a new 
lock requester pauses for CxN before each cache polling, 
where C represents the polling weight for each requester 
and N is the number of the current lock requesters. The 
default value of C is set to zero. 
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struct spinlock.t { 
int current; 
int next; 

}; 

void spin_lock(spinlock_t *lock) { 

int t=fetch_and_inc(&lock—>next); 
while (t != lock—>current) 

pause(C*(t — lock—>current)); 

} 

void spin_unlock{spinlock_t *lock) { 
lock—>current++; 

} 

Figure 5: Pseudocode for backoff-based ticket lock. 

The second modification adds two new locking poli¬ 
cies to handle low-level and high-level contention. These 
two policies are: (1) Test-and-Test-And-Set (TTAS) lock, 
which implements double-checked locking fldl : and (2) 
MCS lock, which lets each lock requester spins on its own 
cache line ll3^ . 

We call this new lock implementation the mixed 
lock. Based on a global variable in the kernel 
(method_tuner), the mixed lock dynamically switches 
to a lock policy to cope with different contention levels. 
This design is motivated by reactive lock EH. 

The data structure of each mixed lock contains four 
components, including the data structures of three locks 
and a mode field, indicating which lock policy this 
mixed lock currently is using. Note that we cannot use 
methodJuner to figure out the current lock policy of a 
mixed lock because the lock may be already held under 
a different policy. All the lock-policy-specific data struc¬ 
tures are stored in the same cache line for high perfor¬ 
mance. The rifiode variable is mostly read but seldom writ¬ 
ten, so it is stored in a different cache line to reduce false 
sharing. 

The mixed lock interface is as fol¬ 
lows. The functions enum release_mode 

acquire_mixed_lock (struct lock* lock, 
struct qnode *node), enum release_mode 
acquire_mixed_trylock (struct lock* 
lock, struct qnode *node), void 

release_mixed_lock (struct lock* lock, 
struct qnode* node, enum release_mode 
mode) and void init_lock (struct lock 

*lock) are used to acquire, try to acquire, release 
and initialize a mixed lock, respectively. The function 
acquire_mixed_lock (...) is used to route the 
lock request to a specific lock protocol (TTAS, MCS or 
back-off based ticket) based on the mode variable of the 
mixed lock. For each lock protocol, if the lock is acquired 


successfully, the method_tuner variable is checked 
to determine whether to switch to another protocol, or 
else, the mode variable of the mixed lock is examined to 
select the correct protocol to wait. The return value of 
acquire_mixed_lock indicates whether to switch to 
a different lock protocol and acts as a parameter of the 
lock releasing function. 

When releasing a mixed lock, if the mode variable in¬ 
dicates that we do not need to change lock protocol, EOS 
simply releases the trivial lock by calling the releasing 
function of the current lock protocol, protocol should 
be switched to another protocol, EOS acquires the target 
protocol, modifies the mode variable, invalidates current 
protocol and finally releases the target protocol. In this 
way, we can ensure that only one valid lock protocol ex¬ 
ists and requesters at invalid protocols will fail the request 
and retry the valid protocol. 

For the three lock protocols in the mixed lock, we give 
priority to the TTAS lock by setting it as the default pro¬ 
tocol, because most locks in the kernel are seldom heav¬ 
ily contended. Thus, in function init_lock, the mode 
variable is TTAS and other locks are invalid. The mixed 
lock needs a total 300 lines of C code to implement. 

6.2 Page Replacement Subsystem 

Linux uses LRU2Q as the page replacement policy to de¬ 
termine which pages should be reclaimed if the free pages 
are not enough for future use. Specifically, it maintains 
two LRU lists, one with pages accessed only once, and 
the other with pages accesses more than once. When a 
page needs to be evicted, LRU2Q considers pages on the 
accessed-once LRU list first. 

We implemented another two page replacement poli¬ 
cies: (1) Adaptive Replacement Cache (ARC) lf35l and (2) 
Clock with Adaptive Replacement and Temporal filtering 
(CART) l20l . 

Besides the two LRU lists used in LRU2Q, ARC or 
CART uses two more lists (called nonresident lists) to 
memorize the pages that are reclaimed from the mem¬ 
ory recently. When a new page is read into the memory, 
ARC or CART check whether the page is memorized in 
the nonresident lists and puts the page into appropriate 
LRU lists. For instance, if the nonresident lists show that 
the page was on the accessed-more-than-once LRU list, 
ARC or CART adds the page directly to this LRU list, 
bypassing the accessed-once list. 

According to previous researches ll20l[35]| . if we do not 
consider the overhead of maintaining the nonresident lists, 
both ARC and CART always perform better than the tra¬ 
ditional LRU page replacement policy theoretically. How- 
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const struct pr.policy arc_pr_policy = { 

/* Initialize the structures. V 
.init.lmvec = init.lmvec.arc, 

/* Decide which pages to reclaim and do the reclaiming. */ 
.get_scan_count = get_scan_count_arc, 

.shrink_lruvec = shrink_lruvec_arc, 

. balance. Iruvec = balance_lruvec_ai‘c, 

.should_continue_reclaim = should_continue_reclaim_arc, 
.too_many_isolated_compaction=too_many_isolated_compaction_arc, 

/* Capture activity and statistics V 
.page.accessed = page_accessed_arc, 

.activate.page = activate.page.arc, 

.deactivate, page = deactivate.page.arc, 

.update.pr.statistics = update.pr.statistics.arc, 

/* Add/release pages from the LRU lists. V 
.add.page = add.page.arc, 

.release.page = release.page.arc, 

/* Helpers used for specific scenarios. */ 

.rotate.inactive.page = rotate.inactive.page.arc, 

.reset.zone.vmstat = reset.zone.vmstat.arc, 

/*For nonresident lists'^/ 

.nonres.remember = nonres.remember.arc, 

.nonres.forget = nonres.forget.arc, 

.isolate = isolate.arc, 

.putback.page = putback.page.arc, 

}; 

Figure 6: Functions implemented in ARC page replace¬ 
ment policy. 

ever, in practice, because the extra overhead of maintain¬ 
ing the nonresident lists is not negligible, ARC and CART 
can perform worse than LRU2Q (see il7.4l for details). 

In order to support dynamic policy switch, we isolated 
the implementation of page replacement policies with the 
page replacement mechanism. Specifically, we modified 
and added a set of interfaces into the Linux kernel so that 
EOS can switch the page replacement policies at runtime. 
Figure|6]gives the functions we implemented in ARC page 
replacement policy. Kernel developers can easily develop 
their own page replacement policies following the same 
pattern in Figure |6] 

To switch the page replacement policy, we 
implement void pr_change (struct zone 

*zone, enum pr_policy_list old_p, enum 
pr_policy_list new_p) that moves all the pages in 
the LRU lists of old_p to the LRU lists of new_p. After 
that, Linux kernel will use the new policy to guard the 
page replacement in memory. 

In EOS, we set CART instead of LRU2Q as the default 
policy because we use the nonresident list hit ratio, which 


Subsystem 

Spec LOC 

Average Speedup 

Disk lO scheduling 

50 

2.58 X 

CPU scheduling 

36 

3.20 X 

Synchronization 

30 

1.81 X 

Page replacement 

16 

1.24 X 


Table 1; Summary of Results. The policies in each sub¬ 
system require 16-50 lines of code to specify. EOS’s per¬ 
formance improvements range from 1.24 to 2.58 times. 

can only be collected in ARC and CART, as the workload 
sensor (see il7.4l for details). One limitation of this im¬ 
plementation is that once we switch the page replacement 
policy to LRU2Q, we cannot switch to other policies. 

7 Evaluation 

We focus our evaluation on the following three questions: 

1. Is EOS easy to use? 

2. Can EOS find parameter settings that significantly 
outperform than default? 

3. Can EOS consistently benefit different kernel subsys¬ 
tems? 

Each of the next four subsections presents a case study 
of applying EOS to one of the four main non-networking 
subsystems in the Linux kernel: disk FO scheduling, CPU 
scheduling, synchronization, and page replacement sub¬ 
systems. Before we diving into the detailed results for 
each subsystem. Table [T] summarizes EOS’s performance. 
For every kernel subsystem evaluated, EOS improved the 
subsystem’s performance by L24x to 3.20 x on average. 
In addition, to specify the policies, we only needed to 
write 16 to 50 lines of code within a few hours; devel¬ 
opers of these subsystems can most likely spend less time 
and annotate the parameters better. These results show 
that EOS is easy to use and can effectively improve per¬ 
formance. 

7.1 I/O Scheduling 

Specifying policies. In the Linux I/O scheduling sys¬ 
tem, we annotated four parameters, previously reported 
to have large performance impact ifTbl l28ll . The first pa¬ 
rameter specifies which I/O scheduling policy to use for 
a disk. The version of Linux we used supports three 
policies: (1) completely fair queuing (“cfq”) Q, (2) 
deadline-based scheduling(“deadline”) im, and (3) first- 
come first-served(“noop”) 0. To change this parameter 
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(method setter in struct param), we used Linux’s function 
elevator_change. The second parameter specihes the max¬ 
imum number of requests on a disk queue. The third pa¬ 
rameter specifies how many pages to read ahead. The last 
parameter specihes the time slice allocated to each pro¬ 
cess when the I/O scheduling policy is set to cfq. To iden¬ 
tify workloads, we used hve sensors: the number of the 
concurrent processes, the read-write ratio of the requests, 
the average I/O request size, the average seek distance 
between two consecutive disk requests, and the average 
time between two consecutive requests. These sensors 
capture typical workload characteristics that signihcantly 
affect performance. We set the optimization target to be 
the number of sectors read or written per second. In total, 
it took us roughly 2 hours and 50 lines of code to specify 
the policies in the I/O subsystem. 

Experimental setup. We used an evaluation machine 
with Ubuntu 13.04 server edition, a quad-core 3.6 GHZ 
Xeon E5-1620 processor, 32 GB memory, and two 2 
TB hard disks. We used seven popular I/O benchmarks 
and one 12-hour Wikipedia lfT4ll I/O trace as the evalu¬ 
ation workloads. Specifically, the skip benchmark If39l 
is a micro-benchmark that starts multiple processes, each 
of which repeatedly reads 4 KB from a 2 GB hie and S' 

skips the following 4KB. The vskip benchmark executes S 

skip in a KVM virtual machine to evaluate EOS’s perfor- u 
mance in a virtual environment. The zmlO lf3^ bench- "g 
mark is another micro-benchmark that starts multiple pro- g 

Q 

cesses, each of which sequentially reads 64 KB from a ^ 
raw disk using direct I/O. The TPC-B benchmark ifTOl 
measures the number of executed transactions per second 
for a database. We used PostgreSQL a as the database 
and runs TPC-B on another machine. The table under test 
was much larger than the total memory size on our eval¬ 
uation machine. The TPC-H benchmark ifT^ is a deci¬ 
sion support benchmark; we ran four processes issuing Q2 
requests to a PostgreSQL database simultaneously. The 
TPC-C benchmark ifTTIl evaluates OLTP applications by 
simulating an environment where multiple users submit 
transactions simultaneously into a database. The vTPC- 
C executes TPC-C in a KVM virtual machine. Parallel 
greplEa is a macro-benchmarks that uses the GNU grep 
utility to look for a non-existent string in the Linux kernel 
3.8.8. Eight processes are started simultaneously, but each 
process calls grep in its own Linux source code copy 
to reduce the possible kernel lock contentions. (We only 
show results of running skip and TPC-C in a virtual en¬ 
vironment because the other workloads had no difference 
executing in a virtual environment or not.) 

Results. Eigure [T] shows the performance improve¬ 
ments of the benchmarks compared to the default setting 



Eigure 7: EOS’s speedup of disk I/O scheduling subsystem 
over default setting. EOS-cold represents results with a 
cold policy cache, and EOS a warm cache. The last two 
columns show average speedup of all workloads. 
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Eigure 8: EOS’s speedup of disk I/O scheduling subsystem 
on a 12-hour Wikipedia trace. 

(deadline in the version of Ubuntu we used, three other 
parameters). With a cold cache, EOS achieves over 11 x 
speedup on skip and 3 x speedup on parallel grep. Its im¬ 
provements on other applications are also quite signih- 
cant, up to 41%. With a warm cache, EOS gains another 
10.0% on average. Since we designed EOS to handle rel¬ 
atively steady, repeatable workloads, we expect that the 
warm-cache case is more often than cold-cache. 

Eigure presents EOS’s performance on a 12-hour 
Wikipedia trace. It outperforms the default policy by over 
19.8% for almost the entire workload. 

Eigure|9] shows the parameter settings EOS selected for 
each workload. Note that EOS may select more than one 
settings over repeated executions when the settings have 
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Figure 9: Percentage of time a disk I/O scheduling policy 
is selected by EOS. 



Figure 10: Search time for each disk I/O scheduling work¬ 
load. 

roughly the same performance. The figure also shows that 
there is no single scheduling policy that fits all workload. 

Figure[T0]shows the time it takes for EOS to find a good 
setting for each workload. This time ranges from 105 sec¬ 
onds to 145 seconds. 

EOS uses workload signatures to search for settings in 
the policy cache, so we also studied the differences and 
similarities of the signatures of the evaluated workloads. 
Figure [TT] plots the average seek distance vs the average 
time between two consecutive disk requests for the work¬ 
loads over repeated executions. These two sensors suffice 
to place each workload in its own singleton cluster. 

7.2 CPU Scheduling 

Specifying Policies. In the CPU scheduling sub¬ 
system, we annotated three parameters. Specifi¬ 



Figure 11: Workload clusters based on two sensors: av¬ 
erage distance and average time between two consecutive 
I/O requests. 



Figure 12: EOS’s speedup of CPU scheduling subsystem 
over default setting. 

cally, sysctLschedJatencyjis specifies the length of 
time for scheduling each runnable process once. 
sysctLsched_min_granularity_ns specifies the minimum 
time a process is guaranteed to run when scheduled. 
sysctLsched_wakeup_granularity_ns, it describes the abil¬ 
ity of processes being waken up to preempt the current 
process. A larger value makes it more difficult preempt 
the current process To identify workloads, we used 
one sensor: the number of retired instructions executed in 
user space. We set the optimization target to be the same. 
Overall, it took us 2 hours and 36 lines of code to specify 
the policy. 

Experimental Setup. The evaluation machine is the 
same as the one used to evaluate the disk I/O subsys¬ 
tem. To demonstrate the performance improvements of 
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our system, we used the CPU scheduling benchmark in 
SysBench, a widely used systems benchmark ifTTll and five 
parallel computing benchmarks in the popular Cilk ||9l 
benchmark suite. SysBench starts multiple threads, each 
of which repeatedly locks a mutex, yields the CPU, and 
unlocks the mutex. The Cilk benchmarks run basic algo¬ 
rithms such as merge sort and fast Fourier transformation 
in parallel. 

Results. Figure[T2]shows EOS’s performance improve¬ 
ments compared to the default. With a cold cache, EOS 
worked well for most workloads and achieved more than 
2x speedup on matmul. With a warm cache, it performed 
even better, achieving a 13 x speedup on matmul. To bet¬ 
ter understand this huge speedup, we analyzed the exe¬ 
cutions of matmul. It turns out that, during the execu¬ 
tion of matmul, sometimes many threads have no work to 
do and are busy trying to steal work from other threads 
without yielding the CPU. These futile work-stealing re¬ 
quests waste many CPU cycles. In this scenario, EOS cor¬ 
rectly adjusted the CPU scheduling parameters to preempt 
threads more often so that the few threads with work to do 
can make good progress, significantly improving perfor¬ 
mance. 

It took EOS 200 seconds to finish the search for each 
workload. (There is only one CPU scheduling policy in 
the version Linux we used, so EOS adjusted the parame¬ 
ters of this policy only.) We also studied the signatures of 
the workload. Based on the signatures, the workloads fall 
into three clusters: (1) SysBench, (2) fft, (3) heat, and (4) 
the rest of the workloads. 

7.3 Synchronization 

Specifying policies. In the synchronization subsystem, 
we annotated two parameters. method_tuner specifies 
which locking policy to use. val_tuner specifies the 
polling weight in the back-off ticket spin lock (C in Fig- 
ure|5]l. vaLtuner is active only when methodJuner is set to 
be back-off based ticket lock. To identify workloads, we 
used one sensor: the average lock acquisition time (the 
time between a lock is requested and the lock is granted). 
We used the number of lock acquisitions per second as 
the optimization target. In total it tooks us 2 hours and 30 
lines of code to specify the policies in the lock subsystem. 

Experimental Setup. We use an evaluation machine 
with Ubuntu 13.04 server edition, eight 48-core 1.9GHZ 
AMD Opteron processors. We used four benchmarks as 
workloads. Specifically, fops is a micro-benchmark 
that measures the locking performance of the Linux di¬ 
rectory entry cache. It starts multiple processes, each of 
which repeatedly opens and closes a private file. The 



Figure 13: EOS’s speedup of synchronization subsystem 
over default setting. EOS-cold represents results with a 
cold policy cache, EOS a warm cache, and unmodified the 
results of the unmodified Linux spin lock implementation. 

mmapbench benchmark is a micro-benchmark written by 
us to measure the locking performance of the Linux mem¬ 
ory map operation. It spawns multiple processes, each 
of which repeatedly maps the same continuous 500 MB 
in shared, read-only mode, reads the first byte, and de¬ 
stroys the mapping. Parallel postmark ETl is a macro¬ 
benchmark that simulates file server workloads hosting 
email and news services. It starts multiple threads, each 
of which runs file system operations repeatedly on an in¬ 
dependent set of files (between 0.5 and lOK bytes in size). 
To measure locking instead I/O performance, we used 
the tmpfs. dbench Gl is a popular macro-benchmark for 
benchmarking file systems. It starts multiple processes, 
each of which does many file operations such as read, 
write, link, and unlink. We also used tmpfs for dbench. 

Results. Figure [T3] presents EOS’s performance im¬ 
provements over the default. To create different lock 
contention levels, we varied the number of processes or 
threads used by each benchmark. For example, fops8 
means running fops with eight processes. With a cold 
cache, fps8, mmapbench, and parallel postmark per¬ 
formed much better with EOS than the default (up to 13 x). 
With a warm cache, EOS performed even better. (Since 
we modified Linux’s spin lock to support multiple locking 
policies, our code adds some overhead. Thus, Figure [13] 
shows the performance of Linux’s unmodified spin lock 
for reference.) 

Figure [T4| shows the locking policy EOS selected for 
each workload. For every workload, EOS consistently se¬ 
lected the same policy for the workload over repeated ex¬ 
ecutions. When the contention level is low (fops2), it se¬ 
lected TTAS. When the contention level is medium (fops3 
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Benchmarks 

Figure 14: Percentage of time a lock policy is selected by 
EOS. 

and dbench), it selected the back-off based ticket lock. 
When the contention level is high (fops8, mmapbench, 
and parallel postmark), it selected MCS. The figure also 
shows that there is no single locking policy that fits all 
workload. 

It took EOS 15 seconds to finish the search for fops2, 
fops8, mmapbench, and parallel postmark; and 75 sec¬ 
onds for fops3 and dbench. We also studied the signatures 
of the workloads. Based on the signatures, each workload 
falls into its own singleton cluster. 

7.4 Page replacement 

Specifying policies. In the Linux memory management 
system, we annotated one parameter that specifies which 
page replacement policy to use for memory. To identify 
workloads, we used one sensor: the hit ratio of the non¬ 
resident lists. This sensor characterizes a performance- 
critical memory access pattern of a workload. Specifi¬ 
cally, when a hit occurs on a page on a nonresident list, 
the kernel can use access history stored in the list to ef¬ 
fectively improve performance. We set the optimization 
target to be the number of page-in or page-out operations 
per second. In total, it took us roughly 2 hours and 16 
lines of code to specify the policies in the memory man¬ 
agement subsystem. 

Experimental setup. We used an evaluation machine 
with Ubuntu 13.10 desktop edition, a quad-core 3.4 GHZ 
i7-2600 processor, 4 GB memory, and one 1 TB hard disk. 
We used one synthetic micro-benchmark and six memory¬ 
intensive benchmarks for big data area and scientific com¬ 
puting area. Specifically, WordCount and TeraSort are 
example programs in Hadoop package Q. WordCount 
counts the number of different words appear in a 10GB 



Figure 15: EOS’s speedup of the page replacement sub¬ 
system over default setting. 


file. TeraSort sorts a large number of numbers (over¬ 
all size is 10GB) using merge sort. Join ll48l is one 
of the popular big data benchmarks developed based on 
hive 1461 . It does join operation on two large tables in a 
mysql database. Heat l2^ simulates the heat distribution 
over time on a metal plate. Stencil l2^ does the 9-point it¬ 
erative stencil computing. Sor l24l is the successive over¬ 
relaxation algorithm, synthetic is a micro-benchmark we 
developed, it requests 3.8GB memory space and writes 
random data to the space round-and-robin for 10 times. 
The data set of all the above benchmarks are larger than 
the memory size so that the page replacement happens fre¬ 
quently. 

Results. Figure [15] shows the performance improve¬ 
ments of the benchmarks compared to the default set¬ 
ting. With a cold cache, EOS achieves 1.74x speedup 
on synthetic. With a warm cache, EOS gains an¬ 
other 72% on synthetic. On real world application 
Sor, EOS achieves 1.43x speedup with a cold cache and 
achieves 1.48 x speedup with a warm cache. On average, 
EOS can speed up the applications by 1.24 times. 

For every workload, EOS consistently selected the same 
policy for the workload over repeated executions. It se¬ 
lected CART for Heat, Stencil, Sor, and synthetic, and the 
default LRU2Q for WordCount, Terasort, and Join. (ARC 
is not selected for any workload.) The figure also shows 
that there is no single page replacement policy that fits all 
workload. 

It took EOS 15 seconds to finish the search for each 
workload. We also studied the signatures of the workload. 
Based on the signatures, the workloads fall into three clus¬ 
ters: (1) Heat, (2) Stencil and Sor, and (3) WordCount, 
TeraSort and Join. 
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8 Related Work 

Massive bodies of work has been devoted to improving 
performance. However, to our knowledge, no prior work 
proposed the idea of a programming API for kernel de¬ 
velopers to describe the policy parameters; nor did any 
prior work proposed the idea of a general policy cache for 
memoizing good parameter settings and reusing them on 
similar workloads. Below we compare EOS to the closely 
related work. 

Performance auto-tuning. Self-adapting operating sys¬ 
tems m aim to collect traces from real workloads and 
run simulations with the traces to adapt the operating sys¬ 
tem implementation, not just the parameters. While this 
goal is exciting, we are not aware of any implementation 
of such a system. 

Many performance tuning algorithms have been pro¬ 
posed, some of which may be incorporated to EOS’s 
search engine for searching a good parameter setting 
within a kernel subsystem. For example, IBM researchers 
use generic algorithms to search for the optimal parameter 
setting for the Anticipatory I/O scheduler and the Zaphod 
CPU scheduler Using several synthetic benchmarks, 
they improved performance by 9% on average. EOS im¬ 
proved performance much more potentially because it can 
(1) cache prior tuning results and (2) tune not only the spe¬ 
cific settings of a policy, but also which policy to use. 

Borrowing from control theory, feedback-based tuning 
algorithms iteratively adjusts parameter settings based on 
some form of feedback. Reactive lock ll3TII dynamically 
selects the correct locking protocol based on the lock con¬ 
tention level; our lock implementation is motivated by re¬ 
active lock and further generalizes it by adding a back-off 
based ticket lock for mid-level contention. Application 
heart beat 1291 is one mechanism for application to pro¬ 
vide custom feedback. It is shown to make Linux’s CFS 
scheduler to deliver predictable performance and temper¬ 
ature ED. Berkeley Tessellation kernel uses feedback 
control to allocate an optimal number of cores for an ap¬ 
plication E6l . Feedback-based algorithms tend to take 
many iterations, so EOS’s policy cache can help them 
avoid costly tuning when an optimal setting already ex¬ 
ists in the cache. 

Another approach is model-based performance tuning 
where researchers construct a performance model for a 
real system, trains the model parameters with offline sim¬ 
ulation, and applies the model online to select optimal 
parameters. However, model construct typically requires 
deep expertise, and is time consuming to build. One case 
study took over a year to optimize four parameters in 
Berkeley DB ll45l . EOS’s policy cache can be viewed as 
a weak, automatically constructed model that captures a 


partial mapping from workloads to good parameter set¬ 
tings. 

Ad hoc parameter caching. Three prior systems touch 
upon the idea of caching good parameter settings for 
certain workloads, but in an ad hoc way. One system 
caches good intermediate “genes” to speed up genetic 
algorithms llJTl . Linux caches congestion control win¬ 
dow sizes based on IP addresses P3l . Another system 
caches the optimal number of virtual machines allocated 
for an online service based on hardware performance 
counters El. The caching in these systems is limited to a 
particular algorithm or parameter, whereas EOS supports 
generalized policy caching. 

9 Conclusion 

We have presented EOS, a system that bridges the knowl¬ 
edge gap between kernel developers and users by auto¬ 
matically evolving the policies and parameters in vivo on 
users’ real, production workloads. It provides a simple 
policy specification API for kernel developers to program¬ 
matically describe how the policies and parameters should 
be tuned, a policy cache to make in-vivo tuning easy and 
fast by memozing good parameter settings for past work¬ 
loads, and a hierarchical search framework to effectively 
search the parameter space. Evaluation of EOS on four 
main Linux subsystems shows that (1) it is easy to use, 
requiring 16 to 50 lines of code to specify the policies in 
each subsystem, and (2) it effectively improves each sub¬ 
system’s performance by an average of 1.24 to 2.58 times 
and sometimes over 13 times. 
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